# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html# Include and execute your code here# import your data here using pandas and the URL# Load the datasetdf = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv", encoding="ISO-8859-1")
Elevator pitch
In this project, I was able to analyze Star Wars survey data to explore fan demographics. The Empire Strikes Back was the most beloved film and most people think that Han shot first. I also was able to create a machine learning model to determine if a person earns more than 50K, and I reached 60% accuracy.
QUESTION|TASK 1
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
In this portion, I just went ahead and cleaned up the cleaned up names. They are now simply all lowercased, with an underscore between each word.
Show the code
# Include and execute your code heredf.columns = [re.sub(r'[^a-zA-Z0-9_]', '', c.strip().lower().replace(" ", "_")) for c in df.columns]df.columns[:10]
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns
type your results and analysis here
Show the code
# Include and execute your code heredf = df[df['have_you_seen_any_of_the_6_films_in_the_star_wars_franchise'] =='Yes']
Show the code
# Include and execute your code hereage_map = {"18-29": 24, "30-44": 37, "45-60": 52, "> 60": 65}df['age_num'] = df['age'].map(age_map)df.drop(columns=['age'], inplace=True)
Show the code
# Include and execute your code hereedu_map = {"Less than high school degree": 1,"High school degree": 2,"Some college or Associate degree": 3,"Bachelor degree": 4,"Graduate degree": 5,}df['education_num'] = df['education'].map(edu_map)df.drop(columns=['education'], inplace=True)
Show the code
# Include and execute your code hereincome_map = {"$0 - $24,999": 12500,"$25,000 - $49,999": 37500,"$50,000 - $99,999": 75000,"$100,000 - $149,999": 125000,"$150,000+": 175000,}df['income_num'] = df['household_income'].map(income_map)df.drop(columns=['household_income'], inplace=True)
Show the code
# Include and execute your code heredf['income_gt_50k'] = (df['income_num'] >50000).astype(int)
Show the code
# Include and execute your code herecategorical_cols = [col for col in ['gender', 'location'] if col in df.columns]df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
QUESTION|TASK 3
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
I used ggPlot to recreate the “Who Shot First” graph and “Whats the best Star Wars movie?” graph.
Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.
type your results and analysis here
Show the code
# Include and execute your code here
STRETCH QUESTION|TASK 2
Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.
type your results and analysis here
Show the code
# Include and execute your code herefrom plotnine import*print("plotnine is installed and ready!")
plotnine is installed and ready!
STRETCH QUESTION|TASK 3
Create a new column that converts the location groupings to a single number. Drop the location categorical column.